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[57] ABSTRACT 

Method and system for automatically determining and 
updating thresholds based on collected data samples for 
counter variables being monitored by a network manage- 
ment application. Historical data is accumulated and used to 
determine the mean and standard deviation of the monitored 
counter variables based on the aggregated sample values and 
a threshold factor is applied to the standard deviation and the 
resulting value added to the mean to establish the threshold 
value. The threshold value, which is adaptively updated, is 
used to determine whether the subsequently sampled values 
of the monitored counter variables are within a normal 
range, or that a potential problem exists requiring interven- 
tion by a network operator. 

14 Claims, 5 Drawing Sheets 
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SYSTEM AND METHOD FOR AUTOMATIC ment counts errors occurring during the operations and also 

DETERMINATION OF THRESHOLDS IN counts the number of times that the error count crosses a 

. NETWORK MANAGEMENT predetermined threshold 

UJS. Pat. No. 5223,827, having the same assignee as the 
5 present invention, improves on other prior art by providing 

BACKGROUND OF THE INVENTION a mechanism for managing network event counters that 

Network management applications often determine con- ™f f s of ii^rmation that can be manipu- 

ditions in the network by monitoring particular streams of lat< ? to P^f 5 a **** of P^™e measurement. * 

data, such as Management MoSon Base (MEB) makc f s usc ° f ^ cvcnt . f un f a event threshold 

counters, gauges, and network node states. Typically, to w coun att * ***** mterval c ° unter for detection of an 

determine whether or not a problem exists, these applica- ™ nt tot P^ 0 "" * some type of 

tions apply thresholds to the rates of change in such action in response. 

counters, or to the values received in gauges. A threshold is WMe me 15 useM 111 man y ^stances, parucu- 

a specified number of occurrences of an event within a ^ Wlth t0 torauung when predetermined 

specified time period which becomes the triggering event for « thresholds have been reached, the art does not provide a 

the performance of particular actions, such as initiating s y stem or mcthod for automatically determining threshold 

specific recovery procedures for a failed device in the conditions to associate with the use of gauges or counters in 

network. Currently, the application user is expected to network management applications. The present invention 

manually determine and set such thresholds. improves on prior art techniques by automatically determin- 

TT ^ . , n 20 ing threshold conditions for a multiplicity of counters or 

However, this current method of having the user manually » . j . 

. 4 • i_i gauges corresponding to particular data streams, 

determine and set thresholds poses two senous problems 00 r * r 

which significantly impact the usability of such an applica- SUMMARY OF THE INVENTION 
aon * It is an object of the invention to provide a more efficient 
First, me user is not familiar enough with the meaning of ^ system and method for automatically determining and set- 
specific counters or gauges to be able to decide what a ting thresholds for event counters and gauges in a commu- 
reasonable threshold should be. Although some counters and nications network without intervention by a network man- 
gauges are well-known and standardized, there is a signifi- agement application user. 

cant number of obscure and non-standard counters and It is a further object of this invention to provide a system 
gauges. Typically, then, the user will disable the monitoring ^ and method for automatically detennining and adaptively 
of such data in the application due to lack of understanding updating thresholds based on data samples, 
(a loss of potentially useful information), or will be forced it is still a further object of this invention to provide a 
to investigate the meaning and usage of the data to be able systcra ^ method that enables the user to selectively 
to determine a reasonable threshold (a loss of productivity), enable or fi S9b \e automatic thresholding for individual 
or ? in the worst case, will be forced to collect statistical data 35 variablcs mat m being monitored at a network node, 
about every counter or gauge in question to manually ^ afld ^ objcctg and advantagcs ^ accomplished 
deterge a reasonable threshold (such statistical data could by the invention in which information for every 
take days to collect). variable being monitored is stored by the network manage- 
Second, it is typically the case that each node in the mcnt application to collect historical behavior far use in 
network to be managed can generate hundreds of different ^ determining and updating a threshold for each monitored 
counters and gauges. If every counter and gauge has a variable. To determine if the collected variables or counters 
separate threshold (or set of thresholds), this implies that m a norma i ran g C( the mean and standard deviation 
there are hundreds of thresholds to be set for each node, and, ttt determined each time the monitored variable is sampled 
therefore, potentially thousands of thresholds to be set for a and a rangc f actox j s applied to this statistical data to 
managed network. Clearly, such a task is tedious, time 45 establish the threshold or set of thresholds for the monitored 
consuming, and, therefore, error prone when placed upon variable. 

the application user. Historical data is updated whenever the variable is 
These two problems with manual threshold sampled, so that the threshold is continuously adapting to 
determination, when combined, significantly impact the recent trends in the behavior of the variable. The network 
usability of a network management application. The more 50 operator or user can select automatic monitoring and thresh- 
heavily the application depends on thresholding as a means 0 idin g cnt ire sets of variables, or can alternatively enable 
of problem deterrninatiori, the more serious the degradation aa d disable monitoring for individual variables. The inven- 
in usability becomes. tf 0I1 further enables the user to disable automatic threshold- 
In general, event counting and thresholding are well ing at any time which results in individual thresholds being 
known in the computing and networking arts. In U.S. Pat 55 frozen at their current levels. 
No. 4^0,589, the occurence of an error triggers a timer DESCRIPTION OF THE DRAWINGS 
and begins a counting interval. Subsequent errors occurring 

during the interval are counted until a predetermined thresh- This invention will be described with respect to a pres- 
old is reached or the timing interval expires. An alarm is ferred embodiment thereof which is further illustrated and 
signalled and the timer is reset if the threshold is reached 60 described in the drawings. 

before expiration of the time interval. In U.S. Pat No. FIG. 1 schematically illustrates an embodiment of the 

4.291.403, if the error count exceeds the predetermined invention environment for a RISC System/6000 processor 

threshold during an established time period, an alarm is operating as the network management station for commu- 

generated and a second threshold is established to measure nications to an SNMP- enabled communications network, 

subsequent error rates. In U.S. Pat. No. 4339,657, a variable 65 FIG. 2 is an operator interface screen display associated 

time interval is established that is measured by the occur- with the network management application that identifies all 

rence of a predetennined number of operations. The arrange- polled nodes of the network. 
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FIG. 3 is a node monitor screen display showing the status 
of a particular node in the network 

FIG. 4 illustrates a flowchart describing the algorithm 
used for automatic threshold determination. 

FIG. 5 is a graphical display of the values of a threshold 
variable over a time interval during which sample values of 
the variable are collected or accumulated. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

The invention finds its application in present day complex 
heterogeneous communications and data processing net- 
works in which a variety of devices or products are managed 
by network management applications which, in turn, are 
monitored by network control operators. In this heteroge- 
neous computing environment use of local area networks 
(LANs) containing numerous and varied personal computers 
and workstations is widespread. A corporate computing 
environment may contain several LANs at a single site 
connected by bridges, or LANs at several different sites 
connected by routers into one or more wide area networks 
(WAN). 

To manage heterogeneous networks, management proto- 
cols such as industry standard Simple Network Management 
Protocol (SNMP) and Open Systems Interconnection Com- 
mon Management Information Protocol (OSI CMIP ) have 
been developed. In addition to the management protocol 
used to communicate between the managing system and the 
managed system or device, a management information base 
(MIB) is defined that provides a set of common managed 
object definitions. MIB variables defined for, and associated 
with, a particular device can be collected and monitored and 
threshold values, if applicable, can be automatically deter- 
mined for the variables monitored. 

FIG. 1 illustrates a typical environment for an SNMP- 
enabled communications network in which the present 
invention is implemented. Network management station 10 
is an IBM RISC System/6000 computer system running 
under the AIX operating system or a comparable processing 
platform. Although not specifically shown, it includes a 
monitor device to provide graphical and textual interface to 
a network operator, system administrator or other user of the 
network management platform. The network management 
platform depicted in FIG. 1 is the NetView/6000, or its 
successor, the NctView for AIX network management plat- 
form. Indicated by reference numeral 12, NetView/6000 
provides both topology/database services and an SNMP 
application programming interface (API) for the IBM ADC 
Router and Bridge Manager/6000 network management 
application 14 in which the present invention is embodied. 
The topology/database services function provides network 
topology discovery capability; i.e.. it determines which 
nodes exist in the network. The SNMP API function formats, 
sends and receives SNMP requests and responses over the 
network. 

Logically positioned on top of theNetView/6000 platform 
12, the Router and Bridge Manager 14 is launched from 
Netview/6000, sends SNMP requests and receives SNMP 
responses. SNMP requests, indicated by arrow 35, ask 
particular nodes on the network for information regarding 
specific variables. The set of variables supported by the 
Router and Bridge Manager 14 is a subset of MIB EL a 
standard defined by RFC 1213. The SNMPresponses 35 sent 
by the node that has been polled to the Router and Bridge 
Manager 14 contains the values of the variables requested by 
the latter. 



,964 

4 

The variable values so received arc collected and grouped 
by the data collection module 18 of the Router and Bridge 
Manager 14. 

Data collection module 18, in conjunction with the auto- 

5 rnatic thresholding function 16, then determines the thresh- 
olds for these variables if automatic thresholding is enabled 
by the user. Automatic thresholding is performed on a per 
variable basis. In other words, it is left to the user's deter- 
mination as to which variables are automatically thresh- 

10 olded. The data collection module 18, after grouping and 
thresholding the variables received, then calls upon the user 
interface module (not shown) of the Router and Bridge 
Manager 14 to display the information to the user and to 
propagate threshold statuses appropriately. Router and 

15 Bridge Manager 14 displays information in a color-coded 
manner, with green indicating normal values, yellow indi- 
cating marginal values (Le., values exceeding a first, lower 
threshold) and red indicating critical values (Le., values 
exceeding a second, higher threshold). The colors can be 

20 propagated to the NetView/6000 topology screen to graphi- 
cally depict to the operator or user, the status of a particular 
network resource, as indicated by arrow 15. 

The automatic thresholding invention described herein 
also includes scripts (sets of prograituning instructions) 

25 which enable a user to run any arbitrary function or set of 
functions in response to a threshold being exceeded. The 
scripts can be customized by the user to deal with a specific 
situation occurring. For example, the user can write a script 
to have a modem attached to the network management 

30 station 10 dial an emergency beeper number if a node on the 
network becomes critical. 

Router and Bridge Manager 14 can poll and threshold 
variables from any reachable node on the SNMP-enabled 
network supporting a standard subset of MIB n variables 
through SNMP/MIB2 interface function 20. The network 
nodes connected to a network management station 10 
include workstation 22, bridge 24, router 26 and other 
SNMP/MIB-II capable nodes 28. To further illustrate the 

_ environment of the invention, connected to the aforemen- 
tioned nodes are token ring LANs 30, 32, 34, 36 and 
Ethernet LAN 38. 

FIG. 2 is a user screen of the Router and Bridge Manager 
14 and identifies all of the nodes that are currently being 

4 5 polled by this network management application. Color cod- 
ing is used to indicate the status of a particular node based 
on threshold values of variables. Auto thresholding may or 
may not be applied to any of the network nodes depicted in 
window 50. Shown in the window are icons for router 

50 ("crouter") 52, router ( 4 *rremont M ) 54, and an hourglass icon 
56 for a node ("rack") that is in the process of discovery by 
the Router and Bridge Manager 14. 

The node monitor screen 60 of the Router and Bridge 
Manager 14 is shown in FIG. 3. The screen displays the 

55 status of a particular node in the network, in this instance, 
router ("crouter") 52. There arc three main sections on this 
display — general system data 62, interfaces 64 and protocols 
66, each of which is divided into "slices"; e.g., total through- 
put in the general system data section. Each slice has a meter 
eo that is associated with it as represented by the bar graphs 63, 
65, 67 in FIG. 3. Each meter indicates the last value 
collected far that variable, the marginal and critical thresh- 
olds represented by the vertical lines, and the low and high 
values (the range) of the meter. The box 71 on the right side 
65 of each slice representation contains an indicator of the 
current flags corresponding to that slice. In this box, "auto" 
indicates that a slice is being thresholded automatically. This 
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also implies that the marginal and critical thresholds dis- value of the sampled data plus the threshold factor TF 

played on the meter will vary with each new data sample as applied to the standard deviation. Different (multiple) 

the node's variables are polled. Adjacent to "auto" is a threshold levels can be established by varying the value of 

number that indicates the total number of data samples that TF. A value of 1.0, for example, sets the threshold to one 

have been collected for automatic thresholding. 5 standard deviation above the mean, indicating that approxi- 

Each slice has a statistics button 73 associated with it mately 68% of the data sampled falls below the threshold 

which, when selected by the user, opens a window of several L **ger values of TF imply larger percentages of sampled 

mare variables which are combined to arrive at an aggregate falling below the threshold, 

status for that slice. Each of these additional variables has a The threshold generated by the algorithm in logic block 

graphical representation that is similar to the slices depicted 10 412 is measured in the same units as the sampled data. Logic 

in FIG. 3. block 414 enables normalization of this threshold value 

FIG. 4 is a flowchart representing the algorithm imple- against a base value so that the threshold can represent a 

mented by this invention. The algorithm automatically deter- percentage of the base value. 

mines reasonable threshold values for any particular datas- In logic block 416, the threshold for the counter or gauge 

tream and then uses the automatically determined value to 15 is set The algorithm proceeds to decision block 418 where 

set the thresholds for a counter or gauge without user a test is made to determine if another threshold needs to be 

intervention. The algorithm is reapplied to each counter or generated. If another threshold does not need to be 

gauge every time a new data sample is available for that generated, the algorithm loops back to logic block 400 to 

counter or gauge to set new thresholds. This allows the await the arrival of the next data sample. To determine 

thresholds to continuously adapt to changing data. Although 20 another threshold value, the algorithm loops back from 

at first the threshold is relatively unstable and prone to wide decision block 418 to logic block 410 to read another 

variations due to the lack of historical data, these variations threshold factor. The algorithm can be used to generate 

tend to diminish as more and more data samples are multiple thresholds far a single counter. One value of the 

consumed, thus stabilizing the threshold. Assuming a nor- threshold factor TF could be used to establish a warning 

mal distribution of incoming data samples, the threshold will 25 level threshold far a counter and a second value (presumably 

eventually converge to a stable value, at which point the higher) of TF could be used to establish a critical threshold, 

automatic thresholding mechanism may be manually frozen As an example, consider a situation in which the network 

at me current threshold value for a particular meter or gauge. management application is monitoring a counter named 

In block 400, the next data sample for a particular counter ^ Packets 13 In for a particular node in the network, and the 

or gauge is received. In decision block 402, a test is made to following samples have been collected so far (measured in 

determine if automatic thresholding is enabled for the moni- packets/minute): 

tared counter. If it is, then in decision block 404, a test is 3503, 3488, 3767, 3246, 3345, 3221, 3400, 3050, 3296, 

made to determine if there is an imminent value overflow for 3006 

the variables associated with the counter that are maintained 35 This counter contains the following values after the last 

for threshold determination. If in decision block 402, auto- data sample: 

matic thresholding for the counter is found to be disabled, 

the algorithm loops back to logical block 400 to await the S = 33322 

next data sample. ^ *» n ,489,976 

There is a limitation to this algorithm that is inherently ^@ 

based on the physical limitations of the supporting operating Assume further, that the application needs two thresholds, 

environment. All operating environments have limits on the one for warning values and one far critical values, and that 

amount of storage which a variable may consume, and it is the corresponding threshold factors are: 
possible mat one of the variables associated with the counter 

for threshold determination may exceed the storage alloca- 45 if**** = 1-° 

tion. Thus, if in decision block 404 a value overflow ^"-^ = 2X) 

condition is found to be iimninent, automatic thresholding is To derive me thresholds T wmiM and T eriticai , the algorithm 

disabled in block 406 and the algorithm loops back to logic i s applied as follows: 

block 400 to await the next data sample. Otherwise, the 

algorithm proceeds to logic block 408 where the application 50 mean = s/N 

updates a small set of variables associated with each counter _ 33 322/10 

or gauge to be monitored and computes the mean and = 3 332 2 

standard deviation far the sampled data. The algorithm adds 

the current data sample obtained in logic block 400 to the S*Wr = square_root ((SQ - 

accumulated sum of previous sample values to arrive at a 55 (JV*MRAN*MEAN)V(tf- 1)) 

running summation S of all data samples collected. The - square_root(50489.733O) 

value of the current data sample is also squared and added m tzaio 
to the sum of the squared values far previous sample values 

to arrive at a running summation, SQ. of the square of all W = (mean +SD vpn ^TF waf ^ f ) 

data samples collected. The number of samples collected is ^ _ 3535,9 
incremented by one and the mean (MEAN) and standard 

deviation (SD) of all sampled data are computed. TV/i**/ = (MEAN + SD^ppw* TFcrad) 

In logic block 410, the threshold factor, TF, is read for a = 3781 -$ 
particular counter and in logic block 412, the threshold for 

the counter is determined by multiplying the threshold factor 65 Based on the historical data, these two thresholds will be set 

by the standard deviation and adding the result to the mean in the network management application without any inter- 

of the sample values. The current threshold value is the mean vention required on the part of the user, and these thresholds 
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will be applied to the next data sample in determining the 
status of the counter or gauge in question. After this 
determination, the next data sample is used to recalculate 
new thresholds adaptively. 

A graphing utility in Router and Bridge Manager 14 
enables the user to display the historical representation of 
thresholded variables associated with a network node. Illus- 
trated in FIG. 5 is a network router's ("croutcr") total traffic 
over a period of two hours, measured in octets (bytes) per 
second. The horizontal line through the middle of the data 
displayed is the mean value for the data (851711.12 octets/ 
sec). The region between the horizontal lines above and 
below the mean value line displays all data within ±1 
standard deviation from the mean value. Any data sample 
above this region exceeds the marginal threshold for the 
variable displayed. As more data samples arrive, the mean 
and standard deviations are updated, thus making the thresh- 
old adaptive. This region will always hold most of the data 
samples; in this way, the user is guaranteed that the thresh- 
olds will only be exceeded when node behavior, as measured 
by MTU variables, varies significantly from the norm (mean 
value) as established by historical values for these variables. 

While the invention has been particularly shown and 
described with reference to the particular embodiment 
thereof, it will be understood by those skilled in the art that 
various changes in form and details may be made therein 25 
without departing from the spirit and scope of the invention. 

Having thus described our invention, what we claim and 
desire to secure as Letters Patent is as follows: 

1. A method for automatically determining and adaptively 
updating thresholds without user intervention in a commu- 
nications network including a network management station 
running a network management application and having at 
least one processor, said network containing a plurality of 
network nodes, said method comprising the steps of: 

selecting at least one variable for each of said plurality of 
network nodes that is to have a threshold value deter- 
mined automatically and adaptively updated; 

receiving a plurality of data sample values for said at least 
one variable that is to have said threshold value deter- 
mined automatically; 

accumulating based on each data sample value received, 
a first sum of each data sample value received and a 
second sum of the square of each data sample value 
received and then discarding said each data sample 
received; 

determining a mean value and a standard deviation for 
said at least one variable based on said first sum, said 
second sum and a total number of data sample values 
received; 

setting at least one threshold value for said at least one 
variable by scaling said standard deviation by a thresh- 
old factor and adding the result to said mean value; and 

graphically displaying a meter depicting said data sample 
value received and said at least one threshold value. 

2. The method of claim 1 wherein said communications 
network is SNMP-enabled and includes at least one network 
router node connected by a communications link to said 
network management station. 

3. The method of claim 1 wherein said communications 
network is SNMP-enabled and includes at least one network 60 
bridge node connected by a communications link to said 
network management station by a network bridge node. 

4. The method of claim 1 wherein a first lower threshold 



5. The method of claim 4 wherein said displayed meter is 
a bar graph representation of said at least one variable and 
includes a minimum value, a maximum value, said first 
threshold value, said second threshold value and a current 
data sample value. 

6. The method of claim 1 further comprising the step of 
polling each of said plurality of network nodes for an update 
on all thresholded variable values monitored at said each 
node and transmitting said all thresholded variable values to 
said network management station for node status determi- 
nation. 

7. The method of claim 1 wherein said graphically dis- 
played data sample value is color encoded based on a 
comparison with said at least one threshold value for said at 
least one variable. 

8. The method of claim 1 further comprising the step of 
disabling automatic thresholding and maintaining said at 
least one threshold value at a current setting if said second 
sum exceeds a maximum number that can be represented by 
said at least one processor at said network management 
station. 

9. The method of claim 1 further comprising executing a 
script of programming instructions if the data sample value 
received exceeds said at least one threshold value. 

10. A system for automatically determining and adap- 
tively updating thresholds without user intervention for at 
least one selected variable at a network node in a commu- 
nications network including a network management station 
running a network management application and having a 

30 processor, said network containing a plurality of network 
nodes, said system comprising: 
a transmitter at said network management station for 
sending a polling request to each of said plurality of 
network nodes to sample a plurality of variables at said 
each network node that are thresholded and monitored; 
a receiver at said network management station for receiv- 
ing a response to said polling request from each of said 
plurality of network nodes, said response including 
data sample values for each of said plurality of vari- 
ables that are thresholded; 
accumulating means in said network management appli- 
cation for said at least one selected variable for storing 
and updating a first sum of each data sample value 
received, a second sum of the square of each data 
sample value received, and a total number of data 
sample values received, said each data sample value 
then being discarded; 
statistics generating means in said network management 
application for determining a mean value and a stan- 
dard deviation for said at least one selected variable 
based on said first sum, said second sum and said total 
number of data sample values received; 
threshold setting means in said network management 
application for scaling said standard deviation and 
combining the result with said mean value to set and 
adaptively update said at least one threshold value; and 
a display for graphically presenting to a user a meter 
depicting for said at least one selected variable each 
said data sample value received and said at least one 
threshold value. 
11. The system of claim 10 wherein said communications 
network is SNMP-enabled and includes at least one network 
router connected by a communications link to said network 
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value for said at least one variable is determined by scaling 

using a first threshold factor and a second, higher threshold 65 management station, 

value is determined by scaling using a second threshold 12. The system of claim 10 wherein said cornmunications 

f actar network is SNMP-enabled and includes at least one network 
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bridge connected by a communications link to said network 
management station. 

13. The system of claim 10 wherein said display repre- 
sents said meter associated with said at least one selected ^ 
variable with a bar graph, said bar graph depicting a mini- 
mum value, a maximum value, a current data sample value 
and said at least one threshold value. 
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14. The system of claim 10 further comprising means in 
said network management application for disabling auto- 
matic thresholding and maintaining said at least one thresh- 
old value at a current setting if said second sum exceeds a 
maximum number that can be represented by said processor 
at said network management station. 

***** 
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